In this project, the open-source R programming language is used to model the progression in the COVID-19 pandemic in different U.S. counties. R is maintained by an international team of developers who make the language available at The Comprehensive R Archive Network. Readers interested in reusing our code and reproducing our results should have R installed locally on their machines. R can be installed on a number of different operating systems (see Windows, Mac, and Linux for the installation instructions for these systems). We also recommend using the RStudio interface for R. The reader can download RStudio for free by following the instructions at the link. For non-R users, we recommend the Hands-on Programming with R for a brief overview of the software’s functionality. Hereafter, we assume that the reader has an introductory understanding of the R programming language.
In the code chunk below, we load the packages used to support our analysis. Note that the code of this and any of the code chunks can be hidden by clicking on the ‘Hide’ button to facilitate the navigation.
if(require(pacman)==FALSE) install.packages("pacman") # check to see if the pacman package is installed; if not install it
if(require(devtools)==FALSE) install.packages("devtools") # check to see if the devtools package is installed; if not install it
# to check and install if these packages are not found locally on machine
if(require(albersusa)==FALSE) devtools::install_github('hrbrmstr/albersusa') #install package if needed
if(require(albersusa)==FALSE) devtools::install_github('dreamRs/r2d3maps') #install package if needed
# check if packages are not installed; if yes, install missing packages
pacman::p_load(tidyverse, magrittr, dataPreparation, recipes, doParallel, psych, skimr, VIM, reshape2, # for data analysis
COVID19, rvest, readxl, # to get county level COVID19 data, scrape Wikipedia, and load xls files
DT, stargazer, # used for having nicely formatted outputs in Markdown
zoo, fpp2, dtwclust, factoextra, TSclust, proxy, NbClust, # for TS analysis and clustering
DataExplorer, scales, ggdendro, gridExtra, RColorBrewer, raster, GGally,# for plotting
leaflet, webshot, albersusa, tigris, plotly, r2d3maps,# for mapping
caret, caretEnsemble, AUC, caTools, DMwR, MLmetrics, # ML packages
rpart, nnet, naivebayes, # rpart, nnet, and multinom packages
conflicted)
# Handling conflicting function names from packages
conflict_prefer('combine', 'dplyr') # Preferring dplyr::combine over any other package
conflict_prefer('select', "dplyr") #Preferring dplyr::select over any other package
conflict_prefer("summarize", "dplyr") # similar to above but with dplyr::summarize
conflict_prefer("filter", "dplyr") # Preferring filter from dplyr
conflict_prefer("dist", "stats") # Preferring dist from stats
conflict_prefer("as.dist", "stats") # Preferring as.dist from stats
set.seed(2020) # to assist with reproducibility
sInfo = sessionInfo()
For our analysis, we fuse data from multiple sources. We describe the process of obtaining and merging each of these sources in the subsections below.
In this section, we utilize the COVID19 package to obtain the following information: (Guidotti & Ardia, 2020)
From this information, we have also computed the new daily and weekly confirmed cases/deaths per county. The data is stored in a tidy format, but can be expanded to a wide format using pivot_wider() from the tidyverse package.
counties = covid19(country = "US",
level = 3, # for county
start = "2020-03-01", # First Sunday in March
end = "2020-10-24", # end Date
raw = FALSE, # to ensure that all counties have the same grid of dates
amr = NULL, # we are not using the apple mobility data for our analysis
gmr = NULL, # we are not using the Google mobility data for our analysis
wb = NULL, # world bank data not helpful for county level analysis
verbose = FALSE)
counties %<>% # next line removes non-contiguous US states/territories
filter(!administrative_area_level_2 %in% c('Alaska', 'Hawaii', 'Puerto Rico', 'Northern Mariana Islands', 'Virgin Islands')) %>%
fastFilterVariables(verbose = FALSE) %>% #dropping invariant columns or bijections
filter(!is.na(key_numeric)) %>% # these are not counties
group_by(id) %>% # grouping the data by the id column to make computations correct
arrange(id, date) %>% # to ensure correct calculations
mutate(day = wday(date, label = TRUE) %>% factor(ordered = F), # day of week
newCases = c(NA, diff(confirmed)), # computing new daily cases per county
newDeaths = c(NA, diff(deaths)) ) # computing new daily deaths per county
# manually identifying factor variables
factorVars = c("school_closing", "workplace_closing", "cancel_events",
"gatherings_restrictions", "transport_closing", "stay_home_restrictions",
"internal_movement_restrictions", "international_movement_restrictions",
"information_campaigns", "testing_policy", "contact_tracing")
counties %<>% # converting those variables into character and then factor
mutate_at(.vars = vars(any_of(factorVars)), .funs = as.character) %>%
mutate_at(.vars = vars(any_of(factorVars)), .funs = as.factor)
cat(paste0("At this stage, we have only read the data based on the covid package. The resulting data is stored at an object titled counties, which contains ", nrow(counties), " observations and ",
ncol(counties), " variables. Note that we have filtered observations that do not have a numeric key and removed some columns that do not add any value to future analysis (e.g., invariant cols)."))
At this stage, we have only read the data based on the covid package. The resulting data is stored at an object titled counties, which contains 739704 observations and 24 variables. Note that we have filtered observations that do not have a numeric key and removed some columns that do not add any value to future analysis (e.g., invariant cols).
In the code chunk below, we merge our counties’ COVID data with seven additional datasets:
Rural/ Underserved Counties: From the Consumer Financial Protection Bureau, we have obtained the Final 2020 List titled: Rural or underserved counties. Per the website, the procedure for determining the classification of a county is as follows: “Beginning in 2020, the rural or underserved counties lists use a methodology for identifying underserved counties described in the Bureau’s interpretive rule: Truth in Lending Act (Regulation Z); Determining “Underserved” Areas Using Home Mortgage Disclosure Act Data.”
Based on the US Census Data, we extracted the land area in square miles for each county, which we combined with population to compute each county’s population density, which we hypothesize to be predictive of hotspots for COVID transmission based on the available COVID-19 literature.
Based on Data & Lab (2018), we have obtained the voting results for all counties in the 2016 Presidential elections. The data was used to compute the percentage of total votes that went to President Trump, with the underlying hypothesis that the politicization of COVID response (e.g., perception/willingness to use face masks, policies and the population’s reaction to the disease) may be explained by party affiliation.
Based on Wikipedia’s Table of State Governors (click for permanent link to the version we scraped), we scraped the Governor’s party affiliation since we hypothesized that it may impact the type of policies used on a state-level. Given that the District of Columbia does not have a governor, when we merged out results back to the counties data frame, we knew (and confirmed) that this would result in NAs. These were the only NAs in that merged column, which we imputed with “Democratic” since D.C.’s Mayor is a Democrat.
Based on the following Kaiser Health News Webpage, we extracted by county information on: (a) number of ICU beds per 10,000 residents; (b) percent of population aged 60+; and (c) number of ICU beds per 10,000 60+aged residents.
We have also engineered a region variable based on the CDC’s 10 Regions Framework. While geographic regions are hypothesized to be a factor in disease outbreaks, we chose to utilize the CDC regions specifically based on the following explanation from the aforementioned link:
> “CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) is strengthening the consistency and quality of the guidance, communications, and technical assistance provided to states to improve coordination across our state programs”
Based on the Census’s Small Area Income and Poverty Estimates (SAIPE) Program, we extracted the estimate for the percent of population in poverty. The estimate is based on 2018 data (released in December 2019). At the time of the start of our analysis, these estimates were the most up to date publicly available data.
# [A] Rural or Urban Classification of the County
ru = read.csv("https://www.consumerfinance.gov/documents/8911/cfpb_rural-underserved-list_2020.csv")
ru %<>% transmute(key_numeric = FIPS.Code, #renaming FIPS.Code to key_numeric
countyType = "Rural/Underserved") # creates two vars and drop old vars
counties = merge(counties, ru, by = "key_numeric", all.x = TRUE) # merging the data back to counties
counties$countyType %<>% replace_na("Other") # for any county not in the Consumer FIN data replace NA by Other
# [B] Population Density of Each County
download.file("https://www2.census.gov/library/publications/2011/compendia/usa-counties/excel/LND01.xls",
destfile = "../Data/LND01.xls", mode = "wb") # downloading Land Area Data Per the 2010 Census
areas = read_excel("../Data/LND01.xls") %>% # reading the Excel file
select(STCOU, LND110210D) #selecting only the FIPS and the Land Area from the 2010 Census variables
colnames(areas) = c("key_numeric", "LandAreaSqMiles2010") # Renaming the columns
areas$key_numeric %<>% as.numeric() # to remove leading 0
counties = merge(counties, areas, by ="key_numeric", all.x = TRUE) # adding the area to data frame
counties$popDensity = counties$population / counties$LandAreaSqMiles2010 # creating the population density variable
counties %<>% select(-c(population, LandAreaSqMiles2010)) #dropping two variables used in creating pop density
# [C] 2016 Presidential Elections County Data from Harvard https://doi.org/10.7910/DVN/VOQCHQ
elections = read.csv("../Data/countypres_2000-2016.csv") %>% # reading the downloaded CSV
filter(year == 2016 & party == "republican") %>% # just keeping data for recent election and republican votes
mutate(key_numeric = FIPS, # renaming FIPS to key_numeric
percRepVotes = 100*(candidatevotes/totalvotes) ) %>% # computing percent of republican votes (from total votes)
select(key_numeric, percRepVotes) # keeping only the key and variable used in merge
counties %<>% merge(elections, by = "key_numeric", all.x = TRUE) # merge with the counties data
# [D] Party of State Governor
stateGovernor = "https://en.wikipedia.org/w/index.php?title=List_of_United_States_governors&oldid=977828843" %>%
read_html() %>% html_node("table:nth-child(9)") %>% html_table(header = 2, fill = TRUE) # scraping data
colnames(stateGovernor) = stateGovernor[1,] # making first row to be the header
stateGovernor = stateGovernor[-1, c(1,5)] # dropping first row
stateGovernor$Party %<>% str_replace('Democratic–Farmer–Labor', 'Democratic') #replacing Democratic–Farmer–Labor w/ Democratic
stateGovernor$Party %<>% str_replace('Republican[note 1]', 'Republican') #replacing Republican[note 1] w/ Republican
colnames(stateGovernor) = c('administrative_area_level_2', 'governorsParty') # renaming columns
counties %<>% merge(stateGovernor, by = 'administrative_area_level_2', all.x = TRUE) # merge
counties$governorsParty %<>% replace_na("Democratic") # 203 NAs are for District of Columbia which has a Democratic Mayor
# [E] Kaiser Health News Data on the County Level
hospitals = read.csv("../Data/data-FPBfZ.csv") %>% # downloaded from KHN on 2020-10-26 (~9:30 pm EDT)
transmute(State = State, # keeping the State Variable | transmute drops variables that are not in call
County = County, # keeping the County Variable
PercentSeniors = Percent.of.Population.Aged.60., # Shortening Original Variable Name
icuBedsPer10000Residents = 10000 * (ICU.Beds/Total.Population), # Computing icuBedsPer10000Residents
icuBedsPer10000Seniors = 10000 * ICU.Beds/Population.Aged.60.) # Computing icuBedsPer10000Seniors
counties %<>% merge(hospitals, by.x = c("administrative_area_level_2", "administrative_area_level_3"),
by.y = c("State", "County"), all.x = TRUE) # merging the variables into counties based on two keys
# [F] CDC Regions for Each State
regionsCDC = data.frame(States = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island' ,
'Vermont', 'New York', # End of Region A
'Delaware', 'District of Columbia', 'Maryland', 'Pennsylvania',
'Virginia', 'West Virginia', 'New Jersey', # End of Region B
'North Carolina', 'South Carolina', 'Georgia', 'Florida', # Region C
'Kentucky', 'Tennessee', 'Alabama', 'Mississippi', # Region D
'Illinois', 'Indiana', 'Michigan', 'Minnesota', 'Ohio',
'Wisconsin', # End of Region E
'Arkansas', 'Louisiana', 'New Mexico', 'Oklahoma', 'Texas', # Region F
'Iowa', 'Kansas', 'Missouri', 'Nebraska', # Region G
'Colorado', 'Montana', 'North Dakota', 'South Dakota',
'Utah', 'Wyoming', # End of Region H
'Arizona', 'California', 'Hawaii', 'Nevada', # Region I
'Alaska', 'Idaho', 'Oregon', 'Washington' # Region J
),
regions = c(rep('Region.A', 7), rep('Region.B', 7), rep('Region.C', 4),
rep('Region.D', 4), rep('Region.E', 6), rep('Region.F', 5),
rep('Region.G', 4), rep('Region.H', 6), rep('Region.I', 4),
rep('Region.J', 4) ) )
counties %<>% merge(regionsCDC, by.x = 'administrative_area_level_2', by.y = 'States', all.x = TRUE) # merge
# [G] Poverty Estimates
download.file("https://www2.census.gov/programs-surveys/saipe/datasets/2018/2018-state-and-county/est18all.xls",
destfile = "../Data/est18all.xls", mode = "wb") # downloading the data for poverty estimates (latest 2018)
poverty = read_excel("../Data/est18all.xls", skip = 3) %>% # reading the data in R
transmute(key_numeric = paste0(`State FIPS Code`, `County FIPS Code`) %>% as.numeric, # creating the key from two variables
povertyPercent = as.numeric(`Poverty Percent, All Ages`) ) # shortening povertyPercent Variable's Name
counties = merge(counties, poverty, by = "key_numeric", all.x = TRUE) # merge
# Final Transformations before Saving the Counties Data
counties %<>% group_by(id) # Needs to be regrouped again after all the merge steps
counties %<>% mutate_at(.vars = c('governorsParty', 'regions'), as.factor) # converting the two vars to factor
# Saving the data into an RDS file
saveRDS(counties, paste0("../Data/counties.rds"))
# Showing a glimpse of the counties data
glimpse(counties)
## Rows: 739,704
## Columns: 32
## Groups: id [3,108]
## $ key_numeric <int> 1001, 1001, 1001, 1001, 1001, 1...
## $ administrative_area_level_2 <chr> "Alabama", "Alabama", "Alabama"...
## $ administrative_area_level_3 <chr> "Autauga", "Autauga", "Autauga"...
## $ id <chr> "fe5db3f0", "fe5db3f0", "fe5db3...
## $ date <date> 2020-05-17, 2020-06-27, 2020-0...
## $ confirmed <int> 110, 498, 1030, 120, 212, 168, ...
## $ deaths <int> 4, 12, 21, 4, 3, 3, 0, 11, 21, ...
## $ school_closing <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ workplace_closing <fct> 3, 2, 2, 3, 3, 3, 3, 2, 2, 2, 2...
## $ cancel_events <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ gatherings_restrictions <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3...
## $ transport_closing <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ stay_home_restrictions <fct> 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2...
## $ internal_movement_restrictions <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ international_movement_restrictions <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ information_campaigns <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ testing_policy <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3...
## $ contact_tracing <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2...
## $ stringency_index <dbl> 72.69, 68.98, 67.13, 72.69, 72....
## $ key_google_mobility <chr> "Alabama, Autauga County", "Ala...
## $ day <fct> Sun, Sat, Sat, Mon, Fri, Mon, S...
## $ newCases <int> 0, 10, 15, 10, 7, 9, 0, 10, 13,...
## $ newDeaths <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0...
## $ countyType <chr> "Other", "Other", "Other", "Oth...
## $ popDensity <dbl> 93.98594, 93.98594, 93.98594, 9...
## $ percRepVotes <dbl> 72.76659, 72.76659, 72.76659, 7...
## $ governorsParty <fct> Republican, Republican, Republi...
## $ PercentSeniors <dbl> 19.1, 19.1, 19.1, 19.1, 19.1, 1...
## $ icuBedsPer10000Residents <dbl> 1.090196, 1.090196, 1.090196, 1...
## $ icuBedsPer10000Seniors <dbl> 5.701796, 5.701796, 5.701796, 5...
## $ regions <fct> Region.D, Region.D, Region.D, R...
## $ povertyPercent <dbl> 13.8, 13.8, 13.8, 13.8, 13.8, 1...
In this section, we provide the results of some of the exploratory analysis that we have performed on the data obtained from the multiple sources. Given that a large part of this exploratory analysis was performed outside of R (using Tableau), we do not provide any code for the visualizations. Note that the base code for the visualizations performed in R can be obtained from the following GitHub Repo.
To prepare the counties data for time-series clustering, we have made the following transformations:
newCases variable within the counties data frame using a 7-day moving average, which would also account for the daily patterns seen in the data (i.e., e.g., smaller number of reported cases over the weekend, etc.). The resulting data was stored in a variable titled newMA7;newMA7 variable such that its values vary between 0 and 1. When the values equals to 1, this denotes that the county is experiencing its maximum newMA7 confirmed cases over our study period;wide data format needed for time series clustering implementations; andclusteringPrep = counties %>% # from the counties
select(id, date, key_google_mobility, newCases) %>% # selecting minimal amount of cols for visual inspection
arrange(id, date) %>% # arranged to ensure correct calculations
mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>% # scaling data to a 0-1 scale by county
select(id, key_google_mobility, date, scaledNewMA7) %>% # dropping newCases
pivot_wider(names_from = date, values_from = scaledNewMA7) # converting the data to a wide format
constantColumns = whichAreConstant(clusteringPrep, verbose = F) # identifying constant columns
paste0(c('The following dates were removed from our data frame since the scaledNEWMA7 variable was constant across all counties: ',
colnames(clusteringPrep)[constantColumns]), collapse = ' ')
## [1] "The following dates were removed from our data frame since the scaledNEWMA7 variable was constant across all counties: 2020-03-01 2020-03-02 2020-03-03 2020-03-04 2020-03-05 2020-03-06 2020-03-07"
clusteringPrep %<>% select(-all_of(constantColumns) ) %>% # speeds up clustering by dec length of series
as.data.frame() # data needs to be data frame for clustering
row.names(clusteringPrep) = clusteringPrep[,1] # needed for tsclust
clusteringPrep = clusteringPrep[,-1] # dropping the id column since it is now row.name
One of the primary objectives of this paper is to examine how the scaled and smoothed time series of daily new cases (i.e., our scaledNewMA7 variable) per county can be grouped. Well, there has been some signs of a possible ‘first wave’ and ‘second wave’ if one were to examine the number of new cases per day for the entire US (e.g., see this USAFACTS Chart). From our examination of a sample of individual counties on the same website, it was not clear that this pattern was indeed reflected on the county level. For example, if we were to examine the visualization for the number of new known cases per day for Kings County, New York, there is limited evidence for any second wave. Hence, in this section, we attempt to examine how the individual counties can be grouped based on their scaledNewMA7 time-series.
It is important to note that, in our estimation, there are three important decisions to be made when performing time-series clustering:
In our preliminary analysis, we will limit the clustering to the state of Connecticut since it only contains 8 counties. Here, we will apply some common distance measures to the scaledNewMA7 and apply hierarchical clustering for each Connecticut county to examine the suitability of the different measures based on a visual inspection of the clusters and their associated dendogram.
ctPrep = clusteringPrep %>% filter(str_detect(key_google_mobility, "Connecticut")) # subsetting to only CT data
row.names(ctPrep) = ctPrep$key_google_mobility %>%
str_remove_all("Connecticut, ") %>%
str_remove_all(" County") # given that we have only one state, we can use just the county name for row names
ctPrep = ctPrep[,-1] # dropping the key_google_moblity column from the data
pr_DB$set_entry(FUN = diss.ACF, names = c("ACFD"),
loop = TRUE, type = "metric", distance = TRUE,
description = "Autocorrelation-based distance") # specifying the distance function based on proxy package
dACF = tsclust(ctPrep, type = "hierarchical", k=2L, distance = "ACFD",
preproc = NULL, seed = 2020) # hierarchical clustering based on the ACFD distance
hclus = cutree(dACF, k = 2) %>% # using a cut tree of 2 to see the order by which counties are grouped
as.data.frame(.) %>%
rename(.,cluster_group = .) %>%
rownames_to_column("County")
hcdata = dendro_data(dACF)
names_order = hcdata$labels$label
p1 = hcdata %>%
ggdendrogram(., rotate=TRUE, leaf_labels=FALSE)
p2 = ctPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
full_join(., hclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group)) %>%
mutate(County = factor(County, levels = rev(as.character(names_order)))) %>%
ggplot(aes(x = Date, y = value, colour = cluster_group, group = County)) +
geom_line()+
facet_wrap(~County, ncol = 1, strip.position="left") +
guides(color=FALSE) +
theme_bw() +
theme(strip.background = element_blank(), strip.text = element_blank())
gp1 = ggplotGrob(p1)
gp2 = ggplotGrob(p2)
grid.arrange(gp2, gp1, ncol=2, widths=c(4,2))
pr_DB$set_entry(FUN = diss.COR, names = c("Cor"),
loop = TRUE, type = "metric", distance = TRUE,
description = "Pearson Correlation-based distance")
cor = tsclust(ctPrep, type = "hierarchical", k=2L, distance = "Cor",
preproc = NULL, seed = 2020)
hclus = cutree(cor, k = 2) %>%
as.data.frame(.) %>%
rename(.,cluster_group = .) %>%
rownames_to_column("County")
hcdata = dendro_data(cor)
names_order = hcdata$labels$label
p1 = hcdata %>%
ggdendrogram(., rotate=TRUE, leaf_labels=FALSE)
p2 = ctPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
full_join(., hclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group)) %>%
mutate(County = factor(County, levels = rev(as.character(names_order)))) %>%
ggplot(aes(x = Date, y = value, colour = cluster_group, group = County)) +
geom_line()+
facet_wrap(~County, ncol = 1, strip.position="left") +
guides(color=FALSE) +
theme_bw() +
theme(strip.background = element_blank(), strip.text = element_blank())
gp1 = ggplotGrob(p1)
gp2 = ggplotGrob(p2)
grid.arrange(gp2, gp1, ncol=2, widths=c(4,2))
dtw = tsclust(ctPrep, type = "hierarchical", k=2L, distance = "dtw", preproc = NULL, seed = 2020)
hclus = cutree(dtw, k = 2) %>%
as.data.frame(.) %>%
rename(.,cluster_group = .) %>%
rownames_to_column("County")
hcdata = dendro_data(dtw)
names_order = hcdata$labels$label
p1 = hcdata %>%
ggdendrogram(., rotate=TRUE, leaf_labels=FALSE)
p2 = ctPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
full_join(., hclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group)) %>%
mutate(County = factor(County, levels = rev(as.character(names_order)))) %>%
ggplot(aes(x = Date, y = value, colour = cluster_group, group = County)) +
geom_line()+
facet_wrap(~County, ncol = 1, strip.position="left") +
guides(color=FALSE) +
theme_bw() +
theme(strip.background = element_blank(), strip.text = element_blank())
gp1 = ggplotGrob(p1)
gp2 = ggplotGrob(p2)
grid.arrange(gp2, gp1, ncol=2, widths=c(4,2))
pr_DB$set_entry(FUN = diss.EUCL, names = c("EUCL"),
loop = TRUE, type = "metric", distance = TRUE,
description = "Euclidean-based distance")
eucl = tsclust(ctPrep, type = "hierarchical", k=2L, distance = "EUCL",
preproc = NULL, seed = 2020)
hclus = cutree(eucl, k = 2) %>%
as.data.frame(.) %>%
rename(.,cluster_group = .) %>%
rownames_to_column("County")
hcdata = dendro_data(eucl)
names_order = hcdata$labels$label
p1 = hcdata %>%
ggdendrogram(., rotate=TRUE, leaf_labels=FALSE)
p2 = ctPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
full_join(., hclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group)) %>%
mutate(County = factor(County, levels = rev(as.character(names_order)))) %>%
ggplot(aes(x = Date, y = value, colour = cluster_group, group = County)) +
geom_line()+
facet_wrap(~County, ncol = 1, strip.position="left") +
guides(color=FALSE) +
theme_bw() +
theme(strip.background = element_blank(), strip.text = element_blank())
gp1 = ggplotGrob(p1)
gp2 = ggplotGrob(p2)
grid.arrange(gp2, gp1, ncol=2, widths=c(4,2))
proxy::pr_DB$set_entry(FUN = diss.PER, names = c("Per"),
loop = TRUE, type = "metric", distance = TRUE,
description = "Periodogram-based distance")
perD = tsclust(ctPrep, type = "hierarchical", k=2L, distance = "Per",
preproc = NULL, seed = 2020)
hclus = cutree(perD, k = 2) %>%
as.data.frame(.) %>%
rename(.,cluster_group = .) %>%
rownames_to_column("County")
hcdata = dendro_data(perD)
names_order = hcdata$labels$label
p1 = hcdata %>%
ggdendrogram(., rotate=TRUE, leaf_labels=FALSE)
p2 = ctPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
full_join(., hclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group)) %>%
mutate(County = factor(County, levels = rev(as.character(names_order)))) %>%
ggplot(aes(x = Date, y = value, colour = cluster_group, group = County)) +
geom_line()+
facet_wrap(~County, ncol = 1, strip.position="left") +
guides(color=FALSE) +
theme_bw() +
theme(strip.background = element_blank(), strip.text = element_blank())
gp1 = ggplotGrob(p1)
gp2 = ggplotGrob(p2)
grid.arrange(gp2, gp1, ncol=2, widths=c(4,2))
Based on the preliminary analysis, the Euclidean distance seems appropriate for our analysis. Therefore, in our full analysis, we will use the Euclidean distance with the \(k\)-means approach as explained earlier in this section.
clusteringPrep %<>% select(-c(key_google_mobility)) # removing this variable so we can cluster
nc = NbClust(clusteringPrep, distance = "euclidean", # euclidean distance
min.nc = 2, max.nc = 50, # searching for optimal k between k=2 and k=50
method = "kmeans", # using the k-means method
index = "all") # using 26 of the 30 indices in the package
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 8 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 29 as the best number of clusters
## * 1 proposed 36 as the best number of clusters
## * 1 proposed 43 as the best number of clusters
## * 1 proposed 50 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
kclus = nc$Best.partition %>% as.data.frame() %>% #obtaining the best partition/ cluster assignment for optimal k
rename(., cluster_group = .) %>% rownames_to_column("County")
#converting the wide to tall data and adding the cluster groupings
clusters = clusteringPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
inner_join(., kclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group))
idClusters = clusters %>% select(c(County, cluster_group)) # creating a look-up table of county and cluster group
colnames(idClusters) = c('id', 'cluster_group') # renaming the columns
idClusters %<>% unique() #removing the duplicates due to different dates (we had that to ensure that the clustering was applied correctly)
# Adding Cluster Grouping to a subset of the counties data frame, which we will use for explanatory modeling
clusterCounties = counties %>%
select(c(id, key_numeric, administrative_area_level_2, administrative_area_level_3,
countyType, popDensity, percRepVotes, governorsParty, PercentSeniors,
icuBedsPer10000Residents, regions, povertyPercent)) %>%
inner_join(., idClusters, by ='id') %>%
unique()
saveRDS(clusterCounties, '../Data/clusterCounties.rds')
In this subsection, we provide seven sets of plots:
All plots can be accessed by toggling through the tabsets below. Note that the third tabs contains subtabs (one for each cluster).
# Obtaining the counties with max population density per cluster in addition to three counties were the authors
# resided (as a sanity check) and the Navajo County in Arizona where COVID outbreaks were reported
PopDensities = clusterCounties %>% ungroup() %>% group_by(cluster_group) %>%
summarise(pd = max(popDensity, na.rm = T))
indices = c(which(clusterCounties$popDensity %in% PopDensities$pd),
which(clusterCounties$key_numeric%in% c(1081, 17119, 39017, 4017)) )
# creating a lookup data frame
lookupDF = clusterCounties[indices, c('id', 'cluster_group')]
# Obtaining national data
usAggregate = covid19(country = 'US', level = 1,
start = '2020-03-01', end = '2020-10-24',
verbose = FALSE)
usAggregate %<>% select(id, key_google_mobility, date, confirmed) %>%
mutate(newCases = c(NA, diff(confirmed)),
newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE))
sampleCounties = counties %>% # from the counties
filter(id %in% lookupDF$id) %>% # only the 8 counties of interest
select(id, date, key_google_mobility, newCases) %>% # selecting minimal amount of cols for visual inspection
arrange(id, date) %>% # arranged to ensure correct calculations
mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) )
# Joining the US Data with the Sample Counties
sampleCountiesPlusUS = rbind(usAggregate, sampleCounties)
sampleCountiesPlusUS = left_join(sampleCountiesPlusUS, lookupDF, by = 'id')
sampleCountiesPlusUS$cluster_group %<>% as.character() %>% replace_na('Not Clustered') %>% as.factor()
sampleCountiesPlusUS$key_google_mobility %<>% recode(US = 'Aggregate for the Entire US')
colorPal = brewer.pal(n= levels(sampleCountiesPlusUS$cluster_group) %>% length(), 'Set2')
names(colorPal) = levels(sampleCountiesPlusUS$cluster_group)
# Facet Wrap Plot: nonScaledMotivationPlot
sampleCountiesPlusUS %>% ggplot(aes(x = date, y = newMA7, group = id, color = cluster_group)) +
geom_line(size = 1) +
facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
theme_bw(base_size = 11) +
theme(legend.position = 'none') +
labs(color = '', x = 'Month', y = 'New Cases By County') +
scale_color_manual(values = colorPal)
sampleCountiesPlusUS %>% ggplot(aes(x = date, y = scaledNewMA7, group = id, color = cluster_group)) +
geom_line(size = 1) +
facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
theme_bw(base_size = 11) +
theme(legend.position = 'none') +
labs(color = '', x = 'Month', y = 'New Cases By County') +
scale_color_manual(values = colorPal)
for (i in 1:length(levels(clusterCounties$cluster_group))) {
indexesForPlot = clusterCounties %>% filter(cluster_group %in% as.character(i)) %>%
pull(id) %>% sample(9) # pulling nine ids from each cluster
# Printing the tabs
cat(paste('#### Cluster', i, '{-} \n \n'))
# Subsetting and plotting the data
counties %>% # from the counties
select(id, date, newCases, key_google_mobility) %>% # selecting minimal columns
filter(id %in% indexesForPlot & !is.na(key_google_mobility)) %>% # selecting the rows of interest
left_join(clusterCounties[, c('id', 'cluster_group')], by = 'id') %>% # to get clusters
arrange(id, date) %>% # arranged to ensure correct calculations
mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>%
ggplot(aes(x = date, y = scaledNewMA7, group = id, color = cluster_group)) +
geom_line(size = 1) +
facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
theme_bw(base_size = 11) +
theme(legend.position = 'none') +
labs(color = '', x = 'Month', y = 'New Cases By County', caption = paste0('Cluster: ', i)) +
scale_color_manual(values = colorPal) -> cplot
print(cplot) # to print out the graphs
cat('\n \n')
}
spaghettiDF = counties %>% # from the counties
select(id, date, newCases, key_google_mobility) %>% # selecting minimal columns
left_join(clusterCounties[, c('id', 'cluster_group')], by = 'id') %>% # to get clusters
arrange(id, date) %>% # arranged to ensure correct calculations
mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>%
ungroup() %>% select(date, cluster_group, scaledNewMA7, key_google_mobility) %>%
group_by(date, cluster_group)
# to have nice panels names but unfortunately will require a new colorPanel
levels(spaghettiDF$cluster_group) = paste('Cluster', levels(spaghettiDF$cluster_group))
colorPal2 = brewer.pal(n= levels(spaghettiDF$cluster_group) %>% length(), 'Set2')
names(colorPal2) = levels(spaghettiDF$cluster_group)
spaghettiDF %>% ggplot(aes(x = date, y = scaledNewMA7, color = cluster_group, group = key_google_mobility)) +
geom_line(size = 0.25, alpha = 0.1) +
stat_summary(aes(group = 1),
fun= median,
geom = "line",
size = 1.25, col = 'black') +
facet_wrap(~ cluster_group, ncol = 2) +
theme_bw(base_size = 9.25) +
theme(legend.position = 'none') +
labs(x = 'Month', y = 'Scaled New Cases By Cluster By Day',
caption = 'Solid black line represents the median for each cluster | Based on Data from Mar 01, 2020 - Oct 24, 2020') +
scale_color_manual(values = colorPal2)